The Data Science Workflow III – Model
Charlotte Fresenius Privatuniversität
April 30, 2025
sales of that product in 200 different markets, along with advertising budgets for three different media: social_media, radio and newspaper.social_media budget, \(X_2\) the radio budget and \(X_3\) the newspaper budget.sales is an output variable:
Clearly, the best fit is achieved by the blue curve, the orange (linear) one is too inflexible, the green one is too flexible (too wiggly).
The comparison of train and test MSE as a function of flexibility reveals an interesting pattern:
This is a fundamental property of statistical learning that holds regardless of data set and statistical method:
As model flexibility increases, the training MSE will decrease, but the test MSE may not.
Let’s see two more examples.
Below, we decompose the three MSE curves from earlier into their three constituent parts: the variance of \(\hat{f}\), its squared bias and \(\text{Var}(\epsilon)\), the irreducible error (represented by the dashed line):
Suppose we wanted to predict whether an individual will default on his or her credit card payment on the basis of annual income (\(X_1\)) and monthly credit card balance (\(X_2\)). Then we might use the \(x\) and the \(y\) axis to display the two predictors and colour to indicate default (orange) or no default (blue), like so:
Suppose we are only interested in the relationship between the social_media budget and sales. Our goal is to obtain estimates \(\hat{\beta}_0\) and \(\hat{\beta}_1\), such that \(y_i \approx \hat{\beta}_0 + \hat{\beta}_1x_i\) for \(i = 1, \ldots, n\). There are multiple ways to do this.
social_media is \(X\) and sales is \(Y\), we get: \[\hat{Y} = \hat{\beta}_0 + \hat{\beta}_1 \cdot X = 7.0326 + 0.0475 \cdot X\]However, this is a data science course and not a statistics course, so we will not compute these quantities by hand. Instead, we want to use R for all this.
To estimate a linear regression model in R, we need two things to pass into the function lm (which stands for “linear model”):
data.frame holding the variables we want included in the model.formula that specifies the model. In simple linear regression, this will take the form of response ~ predictor.So, in our advertising example, we would do the following:
We can then have a look at the result using the summary function:
Call:
lm(formula = sales ~ social_media, data = advertising)
Residuals:
Min 1Q Median 3Q Max
-8.3860 -1.9545 -0.1913 2.0671 7.2124
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.032594 0.457843 15.36 <2e-16 ***
social_media 0.047537 0.002691 17.67 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.259 on 198 degrees of freedom
Multiple R-squared: 0.6119, Adjusted R-squared: 0.6099
F-statistic: 312.1 on 1 and 198 DF, p-value: < 2.2e-16
This output contains a lot of useful information. First of all:
Call: summary of the estimated regression modelResiduals: a five-point-summary of the residuals \(e_i\)Coefficients with six columns:
(Intercept) and social_media).Estimate): OLS estimate of the coefficient, so \(\hat{\beta}_i\) for \(i = 0, 1\).Std. Error): Estimate of the coefficient standard errors, so \(\hat{\sigma}_{\hat{\beta}_i}\).t value): Test statistic \(t\) for the two-sided hypothesis test against 0, i.e. to test the null hypothesis \(H_0: \beta_i = 0\) against the alternative \(H_1: \beta_i \neq 0\).Pr(>|t|)): \(p\)-value of the corresponding hypothesis test.Residual standard error: Estimate of the square root of \(\text{Var}(\epsilon)\).Multiple R-squared: \(R^2\)Adjusted R-squared / F-statistic: see example with multiple predictors.From the created model object simple_reg, we can extract:
coef:fitted:residuals: 1 2 3 4 5 6
4.1292255 1.2520260 1.4497762 4.2656054 -2.7272181 -0.2461623
sigma:confint and its level argument:Additionally, we can extract the following quantities from the created model summary object simple_reg_summary:
coef: Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.03259355 0.457842940 15.36028 1.40630e-35
social_media 0.04753664 0.002690607 17.66763 1.46739e-42
r.squared field of the corresponding list:If we are interested in prediction rather than inference, we need a way to predict new data points. In R, this functionality is provided by the predict function.
It requires the estimated model we want to use for prediction and the values of the features for which predictions should be generated in a data.frame.
Suppose, we want to predict sales for social_media budgets of 100 and 200:
Note that the names of the features (here only social_media) in newdata have to be identical to their names in the training data:
Now, we have all we need to estimate the test MSE of our model using K-fold cross validation! Let’s say for \(K = 5\).
Let’s go step by step: first, we randomly assign each observation to one of the 5 folds. The randomness is once again introduced via the function sample:
n <- nrow(advertising)
K <- 5
advertising$fold <- sample(rep(1:K, length.out = n))
head(advertising, 10) social_media radio newspaper sales fold
1 230.1 37.8 69.2 22.1 1
2 44.5 39.3 45.1 10.4 1
3 17.2 45.9 69.3 9.3 5
4 151.5 41.3 58.5 18.5 5
5 180.8 10.8 58.4 12.9 3
6 8.7 48.9 75.0 7.2 4
7 57.5 32.8 23.5 11.8 1
8 120.2 19.6 11.6 13.2 3
9 8.6 2.1 1.0 4.8 3
10 199.8 2.6 21.2 10.6 2
sales on the \(k\)th fold.sales for all observations, we can then compute the MSE on the \(k\)th fold.mses_linear <- numeric(K)
for(k in 1:K){
m <- lm(sales ~ social_media, data = subset(advertising, fold != k))
preds <- predict(m, newdata = subset(advertising, fold == k))
mse <- mean((preds - advertising$sales[advertising$fold == k])^2)
mses_linear[k] <- mse
}
mses_linear[1] 7.03702 10.98168 12.21465 10.85549 12.54239
Finally, we can compute the cross-validated MSE by calculating the mean of the MSEs across the 5 folds:
Since the MSE lives on a squared scale (due to the squaring in its computation), we can bring the scale back to the original linear scale by taking the square root. The result is called the root mean squared error (RMSE):
In and of itself, these numbers do not tell us too much yet. However, we will be comparing them against the cross-validated (R)MSE of other methods, to see which one is the best.
The term linear in “linear regression” actually refers only to linearity in parameters \(\beta_0\) and \(\beta_1\).
We can therefore easily incorporate non-linearities by appropriately defining the dependent variable \(Y\) and the independent variable \(X\). For example:
Let’s estimate each of these models in our advertising example and assess their test MSEs using 5-fold cross validation again. First, we add the log and square root of the relevant variable to our data set:
Now, let’s assess the test MSE of each of the three above models using 5-fold cross validation:
mses_nonlinear <- matrix(0, K, 3)
for(k in 1:K){
m1 <- lm(log_sales ~ social_media, data = subset(advertising, fold != k))
m2 <- lm(log_sales ~ log_social_media, data = subset(advertising, fold != k))
m3 <- lm(sales ~ sqrt_social_media, data = subset(advertising, fold != k))
preds1 <- exp(predict(m1, newdata = subset(advertising, fold == k)))
preds2 <- exp(predict(m2, newdata = subset(advertising, fold == k)))
preds3 <- predict(m3, newdata = subset(advertising, fold == k))
mses_nonlinear[k, 1] <- mean((preds1 - advertising$sales[advertising$fold == k])^2)
mses_nonlinear[k, 2] <- mean((preds2 - advertising$sales[advertising$fold == k])^2)
mses_nonlinear[k, 3] <- mean((preds3 - advertising$sales[advertising$fold == k])^2)
}
colMeans(mses_nonlinear)[1] 11.83003 10.71977 10.39436
The model \(Y = \beta_0 + \beta_1 \cdot \sqrt{X}\) gives the lowest test MSE out of the three and lower than the linear model as well, albeit not by much.
sales will probably also depend on the radio and newspaper advertising budget.sales is \(Y\), social_media is \(X_1\), radio is \(X_2\) and newspaper is \(X_3\), we get: \[\hat{Y} = 2.9389 + 0.0458 \cdot X_1 + 0.1885 \cdot X_2 - 0.0010 \cdot X_3\]sales on social_media had an adjusted \(R^2\) of 0.6099 (see corresponding R output from earlier).sales on all three advertising budgets has an adjusted \(R^2\) of 0.8956 (see upcoming R output).However, again, this course focuses on how to use R to estimate these types of models and compute these types of statistics.
To estimate a multiple linear regression in R, we simply combine the predictors on the right side of the ~ using a + sign, i.e. response ~ predictor1 + predictor2 + ... + predictorp
So, to estimate the model with all three predictors in our advertising example, we would run the following line of code:
Fortunately, many of the other functionalities we have seen in simple linear regression neatly generalize to the multiple linear regression case!
Again, we inspect the result using the summary function:
Call:
lm(formula = sales ~ social_media + radio + newspaper, data = advertising)
Residuals:
Min 1Q Median 3Q Max
-8.8277 -0.8908 0.2418 1.1893 2.8292
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.938889 0.311908 9.422 <2e-16 ***
social_media 0.045765 0.001395 32.809 <2e-16 ***
radio 0.188530 0.008611 21.893 <2e-16 ***
newspaper -0.001037 0.005871 -0.177 0.86
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.686 on 196 degrees of freedom
Multiple R-squared: 0.8972, Adjusted R-squared: 0.8956
F-statistic: 570.3 on 3 and 196 DF, p-value: < 2.2e-16
newspaper have a significant influence on sales on the 0.1% significance level.sales can be explained with this model.But how does its test MSE hold up to the models estimated so far? Let’s investigate this question using our cross-validation routine again:
mses_big <- numeric(K)
for(k in 1:K){
m <- lm(sales ~ social_media + radio + newspaper,
data = subset(advertising, fold != k))
preds <- predict(m, newdata = subset(advertising, fold == k))
mse <- mean((preds - advertising$sales[advertising$fold == k])^2)
mses_big[k] <- mse
}
results <- data.frame(model_id = factor(1:5),
model_names = model_names,
test_mse = c(mean(mses_linear),
colMeans(mses_nonlinear),
mean(mses_big)))
knitr::kable(results[,-1], col.names = c("Model", "Test MSE"),
digits = 2, row.names = TRUE)| Model | Test MSE | |
|---|---|---|
| 1 | \(Y \sim X_1\) | 10.73 |
| 2 | \(\log(Y) \sim X_1\) | 11.83 |
| 3 | \(\log(Y) \sim \log(X_1)\) | 10.72 |
| 4 | \(Y \sim \sqrt{X_1}\) | 10.39 |
| 5 | \(Y \sim X_1 + X_2 + X_3\) | 3.06 |
Unsurprisingly, the model that uses all three advertising budgets as predictors is much better than any of the models that just uses the social media budget!
ggplot(results, aes(x = model_id, y = test_mse, fill = model_id)) +
geom_bar(stat = "identity") +
labs(x = "Model ID", y = "Test MSE",
title = "Test MSE of the linear models estimated",
subtitle = "Test MSE determined via 5-fold cross validation",
caption = "Source: advertising data") +
guides(fill = "none") +
scale_fill_brewer(palette = "Set2") +
theme_minimal()So far, all the predictors we have looked at have been quantitative variables. In practice, we often have to deal with categorical predictors as well.
To see an example, let’s consider the Credit data set from the ISLR package. It contains socio-economic information on 400 credit card customers.
The data set contains seven quantitative variables:
library(ISLR)
head(Credit[, c("Income", "Limit", "Rating", "Cards", "Age", "Education", "Balance")]) Income Limit Rating Cards Age Education Balance
1 14.891 3606 283 2 34 11 333
2 106.025 6645 483 3 82 15 903
3 104.593 7075 514 4 71 11 580
4 148.924 9504 681 3 36 11 964
5 55.882 4897 357 2 68 16 331
6 80.180 8047 569 4 77 10 1151
The goal here is to build a predictive model for balance, the average credit card balance of a given customer.
However, aside from the quantitative variables, the data set also contains four qualitative (categorical) variables:
So how can we include the information in these qualitative predictors to benefit our linear model for balance?
The answer lies in dummy variables, i.e. variables that only take on the values 0 or 1. For a categorical variable with \(K\) levels, we create \(K-1\) dummy variables. Let’s see an example of how this works.
The variable Ethnicity takes on \(K = 3\) different values:
So, we can encode the information in the Ethnicity variable in \(K - 1 = 2\) dummy variables, called \(d_1\) and \(d_2\). For example, one could be used to indicate Asian ethnicity and the other to indicate Caucasian: \[d_{1i} = \begin{cases}
1 & \text{if person } i \text{ is of Asian ethnicity} \\
0 & \text{otherwise}
\end{cases}\] \[d_{2i} = \begin{cases}
1 & \text{if person } i \text{ is of Caucasian ethnicity} \\
0 & \text{otherwise}
\end{cases}\] Note that a third dummy variable is not necessary.
We could manually add these dummy variables to the data set:
Credit$Asian <- ifelse(Credit$Ethnicity == "Asian", 1, 0)
Credit$Caucasian <- ifelse(Credit$Ethnicity == "Caucasian", 1, 0)
head(Credit[,c("Ethnicity", "Asian", "Caucasian")], 8) Ethnicity Asian Caucasian
1 Caucasian 0 1
2 Asian 1 0
3 Asian 1 0
4 Asian 1 0
5 Caucasian 0 1
6 Caucasian 0 1
7 African American 0 0
8 Asian 1 0
With the categorical data now numerically encoded, we can estimate the following model: \[\hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1 d_{1i} + \hat{\beta}_2 d_{2i},\] where \(y_i\) is the \(i\)th observation of the balance variable.
Let’s estimate this model with R. Fortunately, as long as the categorical variables are properly encoded as factors, we do not have to create the dummy variables ourselves. Let’s check:
This looks good! Note that R will always automatically use the first factor level as the reference category when estimating a linear model with factors:
Note that with the proper factor encoding of Ethnicity, we do not need to make use of the dummy variables Asian and Caucasian we manually added to the data set.
Let’s have a look at the created model object:
Call:
lm(formula = Balance ~ Ethnicity, data = Credit)
Residuals:
Min 1Q Median 3Q Max
-531.00 -457.08 -63.25 339.25 1480.50
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 531.00 46.32 11.464 <2e-16 ***
EthnicityAsian -18.69 65.02 -0.287 0.774
EthnicityCaucasian -12.50 56.68 -0.221 0.826
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 460.9 on 397 degrees of freedom
Multiple R-squared: 0.0002188, Adjusted R-squared: -0.004818
F-statistic: 0.04344 on 2 and 397 DF, p-value: 0.9575
This shows that on average…
However, as the regression output shows, these differences in average credit card balances by ethnicity are clearly not significant.
Note that the group averages inferred from the regression coefficients genuinely correspond to the group averages:
Based on this approach of dummy variable encoding, we can include any number of categorical variables into the regression equation.
As an example, the next slide displays the regression output from using all available predictors (quantitative and qualitative) to model credit card balance in the usual additive fashion.
credit_model2 <- lm(Balance ~ Income + Limit + Rating + Cards + Age + Education +
Gender + Student + Married + Ethnicity,
data = Credit)
summary(credit_model2)
Call:
lm(formula = Balance ~ Income + Limit + Rating + Cards + Age +
Education + Gender + Student + Married + Ethnicity, data = Credit)
Residuals:
Min 1Q Median 3Q Max
-161.64 -77.70 -13.49 53.98 318.20
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -479.20787 35.77394 -13.395 < 2e-16 ***
Income -7.80310 0.23423 -33.314 < 2e-16 ***
Limit 0.19091 0.03278 5.824 1.21e-08 ***
Rating 1.13653 0.49089 2.315 0.0211 *
Cards 17.72448 4.34103 4.083 5.40e-05 ***
Age -0.61391 0.29399 -2.088 0.0374 *
Education -1.09886 1.59795 -0.688 0.4921
GenderFemale -10.65325 9.91400 -1.075 0.2832
StudentYes 425.74736 16.72258 25.459 < 2e-16 ***
MarriedYes -8.53390 10.36287 -0.824 0.4107
EthnicityAsian 16.80418 14.11906 1.190 0.2347
EthnicityCaucasian 10.10703 12.20992 0.828 0.4083
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 98.79 on 388 degrees of freedom
Multiple R-squared: 0.9551, Adjusted R-squared: 0.9538
F-statistic: 750.3 on 11 and 388 DF, p-value: < 2.2e-16
With over 95% variance explained, this is a very good model for credit card balance. However, is it reasonable to leave all of the predictors in the model? Or are we better off removing some of them?
step function. It requires the null model (as a lower bound) and the full model with all variables (as an upper bound) as input.m0 <- lm(Balance ~ 1, data = Credit)
credit_model_fw <- step(m0, scope = Balance ~ Income + Limit + Rating + Cards + Age +
Education + Gender + Student + Married + Ethnicity,
direction = "forward")Start: AIC=4905.56
Balance ~ 1
Df Sum of Sq RSS AIC
+ Rating 1 62904790 21435122 4359.6
+ Limit 1 62624255 21715657 4364.8
+ Income 1 18131167 66208745 4810.7
+ Student 1 5658372 78681540 4879.8
+ Cards 1 630416 83709496 4904.6
<none> 84339912 4905.6
+ Gender 1 38892 84301020 4907.4
+ Education 1 5481 84334431 4907.5
+ Married 1 2715 84337197 4907.5
+ Age 1 284 84339628 4907.6
+ Ethnicity 2 18454 84321458 4909.5
Step: AIC=4359.63
Balance ~ Rating
Df Sum of Sq RSS AIC
+ Income 1 10902581 10532541 4077.4
+ Student 1 5735163 15699959 4237.1
+ Age 1 649110 20786012 4349.3
+ Cards 1 138580 21296542 4359.0
+ Married 1 118209 21316913 4359.4
<none> 21435122 4359.6
+ Education 1 27243 21407879 4361.1
+ Gender 1 16065 21419057 4361.3
+ Limit 1 7960 21427162 4361.5
+ Ethnicity 2 51100 21384022 4362.7
Step: AIC=4077.41
Balance ~ Rating + Income
Df Sum of Sq RSS AIC
+ Student 1 6305322 4227219 3714.2
+ Married 1 95068 10437473 4075.8
+ Limit 1 94545 10437996 4075.8
+ Age 1 90286 10442255 4076.0
<none> 10532541 4077.4
+ Education 1 20819 10511722 4078.6
+ Ethnicity 2 67040 10465501 4078.9
+ Cards 1 2094 10530447 4079.3
+ Gender 1 948 10531593 4079.4
Step: AIC=3714.24
Balance ~ Rating + Income + Student
Df Sum of Sq RSS AIC
+ Limit 1 194718 4032502 3697.4
+ Age 1 44620 4182600 3712.0
<none> 4227219 3714.2
+ Married 1 13083 4214137 3715.0
+ Gender 1 12168 4215051 3715.1
+ Cards 1 10608 4216611 3715.2
+ Education 1 1400 4225820 3716.1
+ Ethnicity 2 22322 4204897 3716.1
Step: AIC=3697.37
Balance ~ Rating + Income + Student + Limit
Df Sum of Sq RSS AIC
+ Cards 1 166410 3866091 3682.5
+ Age 1 37952 3994549 3695.6
<none> 4032502 3697.4
+ Gender 1 13345 4019157 3698.0
+ Married 1 6660 4025842 3698.7
+ Education 1 5795 4026707 3698.8
+ Ethnicity 2 17704 4014797 3699.6
Step: AIC=3682.52
Balance ~ Rating + Income + Student + Limit + Cards
Df Sum of Sq RSS AIC
+ Age 1 44472 3821620 3679.9
<none> 3866091 3682.5
+ Gender 1 11350 3854741 3683.3
+ Education 1 5672 3860419 3683.9
+ Married 1 3121 3862970 3684.2
+ Ethnicity 2 14756 3851335 3685.0
Step: AIC=3679.89
Balance ~ Rating + Income + Student + Limit + Cards + Age
Df Sum of Sq RSS AIC
<none> 3821620 3679.9
+ Gender 1 10860.9 3810759 3680.7
+ Married 1 5450.6 3816169 3681.3
+ Education 1 5241.7 3816378 3681.3
+ Ethnicity 2 11517.3 3810102 3682.7
The final model found with forward selection contains predictors Rating, Income, Student, Limit, Cards and Age. It has an AIC of 3679.9. The object credit_model_fw can be used like any other lm object.
step function. It only requires the full model (as an upper bound) and the direction backward as input.Start: AIC=3686.22
Balance ~ Income + Limit + Rating + Cards + Age + Education +
Gender + Student + Married + Ethnicity
Df Sum of Sq RSS AIC
- Ethnicity 2 14084 3800814 3683.7
- Education 1 4615 3791345 3684.7
- Married 1 6619 3793349 3684.9
- Gender 1 11269 3798000 3685.4
<none> 3786730 3686.2
- Age 1 42558 3829288 3688.7
- Rating 1 52314 3839044 3689.7
- Cards 1 162702 3949432 3701.0
- Limit 1 331050 4117780 3717.7
- Student 1 6326012 10112742 4077.1
- Income 1 10831162 14617892 4224.5
Step: AIC=3683.7
Balance ~ Income + Limit + Rating + Cards + Age + Education +
Gender + Student + Married
Df Sum of Sq RSS AIC
- Married 1 4545 3805359 3682.2
- Education 1 4757 3805572 3682.2
- Gender 1 10760 3811574 3682.8
<none> 3800814 3683.7
- Age 1 45650 3846464 3686.5
- Rating 1 49473 3850287 3686.9
- Cards 1 166806 3967621 3698.9
- Limit 1 340250 4141064 3716.0
- Student 1 6372573 10173387 4075.5
- Income 1 10838891 14639705 4221.1
Step: AIC=3682.18
Balance ~ Income + Limit + Rating + Cards + Age + Education +
Gender + Student
Df Sum of Sq RSS AIC
- Education 1 5399 3810759 3680.7
- Gender 1 11019 3816378 3681.3
<none> 3805359 3682.2
- Age 1 43545 3848904 3684.7
- Rating 1 46929 3852289 3685.1
- Cards 1 170729 3976088 3697.7
- Limit 1 352112 4157472 3715.6
- Student 1 6461978 10267338 4077.2
- Income 1 10860901 14666261 4219.8
Step: AIC=3680.75
Balance ~ Income + Limit + Rating + Cards + Age + Gender + Student
Df Sum of Sq RSS AIC
- Gender 1 10861 3821620 3679.9
<none> 3810759 3680.7
- Age 1 43983 3854741 3683.3
- Rating 1 49520 3860279 3683.9
- Cards 1 170898 3981656 3696.3
- Limit 1 347654 4158413 3713.7
- Student 1 6472072 10282831 4075.8
- Income 1 10855780 14666539 4217.8
Step: AIC=3679.89
Balance ~ Income + Limit + Rating + Cards + Age + Student
Df Sum of Sq RSS AIC
<none> 3821620 3679.9
- Age 1 44472 3866091 3682.5
- Rating 1 49263 3870883 3683.0
- Cards 1 172930 3994549 3695.6
- Limit 1 347898 4169518 3712.7
- Student 1 6462599 10284218 4073.9
- Income 1 10845018 14666638 4215.8
The final model found with backward selection also contains predictors Rating, Income, Student, Limit, Cards and Age. With an AIC of 3679.9, it is the same model that was found with forward selection. The object credit_model_bw can be used like any other lm object.
Plots of \(\hat{f}(X)\) using KNN regression on a two-dimensional data set with 64 observations (orange dots). Left: \(K = 1\) results in a rough step function fit. Right: \(K = 9\) produces a much smoother fit.
Plots of \(\hat{f}(X)\) using KNN regression on a one-dimensional data set with 50 observations. Clearly, when the true relationship (black line) is linear, KNN regression is sub-optimal, regardless of whether \(K = 1\) (left) or \(K = 9\) (right).
For a non-linear relationship between \(X\) and \(Y\), however, KNN outperforms linear regression. On the left, we see KNN fits for \(K = 1\) (blue) and \(K = 9\) (red) compared to the true relationship (black). On the right, we see that the test set MSE for linear regression (horizontal black line) is significantly higher than the test set MSE for KNN with different values of \(K\) (displayed as \(1/K\)).
salary (in USD) and age (in years): for the KNN, a difference of 1000 USD in salary is enormous compared to a difference of 50 years in age.salary rather than age.In R, KNN regression can be achieved with the help of the FNN package. It provides a function called knn.reg, which requires a data frame of features, a vector of responses and a value for \(K\) to estimate a KNN model. Let’s apply that to our advertising example, say for \(K = 3\):
PRESS = 373.3867
R2-Predict = 0.9310732
With the knn.reg function, we can do fitting and predicting in one function call. This requires passing the prediction features as the test argument to the function. We use this functionality in our existing cross-validation scheme:
n_neighbours <- 1:10 # number of neighbours considered
mses_knn <- matrix(NA, K, length(n_neighbours))
colnames(mses_knn) <- sprintf("K = %d", n_neighbours)
for(k in 1:K){
preds <- lapply(n_neighbours, function(x) knn.reg(train = subset(advertising, fold != k, 1:3),
test = subset(advertising, fold == k, 1:3),
y = advertising$sales[advertising$fold != k],
k = x)$pred)
mses_knn[k, ] <- colMeans((do.call(cbind, preds) - advertising$sales[advertising$fold == k])^2)
}
round(colMeans(mses_knn), 2) K = 1 K = 2 K = 3 K = 4 K = 5 K = 6 K = 7 K = 8 K = 9 K = 10
2.33 1.94 2.27 2.37 2.49 2.68 2.92 3.24 3.61 3.73
The best test MSE is achieved by setting \(K = 2\). The resulting model has a test MSE of 1.94, which is also considerably better than the best-performing linear model, which had test MSE 3.06.
In the introduction, we already saw that a common way of evaluating classifiers is the error rate, i.e. the proportion of misclassified observations.
However, there are many more ways to evaluate classifiers. One typically looks at the joint distribution of observed response and predicted response.
For binary classification, we can illustrate this distribution with the help of a so-called confusion matrix:
| observed/true = 0 | observed/true = 1 | |
|---|---|---|
| predicted = 0 | true negatives (TN) | false negatives (FN) |
| predicted = 1 | false positives (FP) | true positives (TP) |
For an extended version of this table, see here.
Suppose we have a data set of 100 credit card customers, of which 5 have defaulted on their credit card debt. We want to build a classifier that tells us whether someone will default or not.
After training, the classifier produced the following confusion matrix:
| observed/true = 0 | observed/true = 1 | Total | |
|---|---|---|---|
| predicted = 0 | 94 | 4 | 98 |
| predicted = 1 | 1 | 1 | 2 |
| Total | 95 | 5 | 100 |
Accuracy can be misleading for unbalanced class distributions!
Let’s visualize the relationship between area_income and the response clicked_on_ad in a scatter plot (with a slight jitter):
Let’s finally see an example of how to do all this in R!
To estimate a logistic regression in R, we need three things to pass into the function glm (which stands for “generalized linear model”):
data.frame holding the variables we want included in the model, especially the binary response as a factor or numeric.formula that specifies the model, similar to the lm case.family that specifies the type of generalized linear model. For logistic regression, this is always equal to binomial().So, if we wanted to use area_income and age as predictors for the response clicked_on_ad, we would do the following:
We can then have a look at the result using the summary function:
Call:
glm(formula = clicked_on_ad ~ area_income + age, family = binomial(),
data = click)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 9.160e-02 5.399e-01 0.17 0.865
area_income -1.033e-04 8.207e-06 -12.59 <2e-16 ***
age 1.626e-01 1.261e-02 12.90 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1386.3 on 999 degrees of freedom
Residual deviance: 880.0 on 997 degrees of freedom
AIC: 886
Number of Fisher Scoring iterations: 5
While the output looks somewhat different to the linear regression case, the all-important coefficients matrix with Estimate, Std. Error, z value and Pr(>|z|) looks virtually identical.
area_income (in thousand USD) and \(X_2\) for age, the estimated model for the click probability \(p\) thus has the following form: \[\log\left( \frac{p}{1 - p} \right) = 0.0916 - 0.1033 X_1 + 0.1626 X_2\]area_income by 1000 USD decreases the log odds for clicking on the ad by 0.1033 on average.age by one year increases the log odds for clicking on the ad by 0.1626 on average.area_income and age seem to be important for predicting whether or not a user will click on the ad.In general, there is a high degree of consistency in how we can operate on glm objects compared to lm objects. For example, many accessor functions work in the same way as before:
1 2 3 4 5 6
-0.9342594 -0.5191394 -0.5391085 -0.8425931 -0.5408458 -0.4287287
1 2 3 4 5 6
0.3536540 0.1260681 0.1352536 0.2988136 0.1360644 0.0878074
(Intercept) area_income age
0.0915995255 -0.0001032967 0.1626462670
2.5 % 97.5 %
(Intercept) -0.9649226490 1.154141e+00
area_income -0.0001198858 -8.768127e-05
age 0.1386819426 1.881543e-01
Due to the different scales involved (probabilities vs. log odds), some of these functions come with additional options, as we shall now see for prediction.
predict function in R:
type to link (which is the default):new_click <- data.frame(area_income = 50000, age = 29)
predict(logreg, newdata = new_click, type = "link") 1
-0.3564913
type to response:Again, we randomly assign each observation to one of 5 folds:
This time, we save the class predictions on each hold-out fold together into one variable called class_preds_lr. This will then allow us to evaluate the out-of-sample performance of the classifier:
Now, we can compute the out-of-sample confusion matrix of our classifier:
From this, we can now calculate the aforementioned performance metrics:
step function for forward and backward selection works equally on glm objects.ggplot(click, aes(x = area_income, y = age, color = factor(clicked_on_ad))) +
geom_point() +
geom_abline(slope = -coef(logreg)[2]/coef(logreg)[3],
intercept = -coef(logreg)[1]/coef(logreg)[3]) +
labs(title = "Linear decision boundary of logistic regression",
subtitle = "Illustration of linear decision boundary estimated via logistic regression on click data",
color = "Clicked") +
theme_minimal() +
scale_color_colorblind()Suppose we have two classes (\(+\) and \(-\)) and two features \(X_1\) and \(X_2\). The figure below illustrates how conditional probability would be estimated for a point \(x\) and \(K=1,2,3\) in a toy example data set:
\[K = 1, \hat{P}(+|x) = 0/1 \quad K = 2, \hat{P}(+|x) = 1/2 \quad K = 3, \hat{P}(+|x) = 2/3\]
FNN package:
knn, whose arguments are very similar to the knn.reg function we saw earlier.area_income is measured in USD and age is measured in years. To standardize a variable, we deduct the mean and divide by the standard deviation:Now, we can actually run the KNN classification for \(K=1, \ldots, 10\) and predict each fold in the 5-fold cross validation scheme from earlier:
n_neighbours <- 1:10
for(nn in n_neighbours){
knn_colname <- sprintf("class_preds_KNN%d", nn)
click[[knn_colname]] <- factor(NA, levels = c(0, 1))
for(k in 1:K){
preds <- knn(train = click[click$fold != k, c("ai_scaled", "age_scaled")],
test = click[click$fold == k, c("ai_scaled", "age_scaled")],
cl = click$clicked_on_ad[click$fold != k],
k = nn)
click[click$fold == k, knn_colname] <- preds
}
}Let’s compare the performance (in terms of classification accuracy on the test set) of these KNN models to the logistic regression model from earlier!
res_knn <- colMeans(as.matrix(click[, grep("_KNN", names(click), fixed = TRUE)]) == click$clicked_on_ad)
res_lr <- mean(click$class_preds_lr == click$clicked_on_ad)
res <- data.frame(model = c(sub("class_preds_", "", names(res_knn)), "LR"),
accuracy = c(res_knn, res_lr))
res$model <- factor(res$model, levels = res$model)
rownames(res) <- NULL
ggplot(res, aes(x = model, y = accuracy)) +
geom_bar(stat = "identity", fill = "deepskyblue4") +
labs(x = "Model", y = "Test accuracy",
title = "Test accuracy of the KNN models and logistic regression",
subtitle = "Test accuracy via 5-fold cross validation",
caption = "Source: click data") +
guides(fill = "none") +
theme_minimal()click data set, it seems that the true decision boundary is close to linear, so the additional flexibility does not provide any benefit.Hitters data set in the ISLR package contains statistics of 322 baseball players in the American MLB from the 1986 and 1987 seasons.Salary based on Years (number of years in the MLB) and Hits (number of hits he made in the previous year)Salary values and log transform the remaining salaries.Hits and Years in the MLB, had similar salaries:
Years < 5Years >= 5 and Hits < 118Years >= 5 and Hits >= 118We have already built our first regression tree! The reason for the name is that we can illustrate our decisions on the groups in a tree structure:
Years < 5Years >= 5 and Hits < 118Years >= 5 and Hits >= 118In the baseball example, we first consider all possible splits on Years:
Then, we run the same procedure for Hits to check if we get any lower RSS:
Years, the lowest RSS was achieved for \(s = 5\). The lowest achievable RSS was 115.06.Hits, the lowest RSS was achieved for \(s = 118\). The lowest achievable RSS was 160.97.Years for \(s = 5\). We split the predictor space into two rectangles: one where Years < 5 and one where Years >= 5, both times irrespective of the number of Hits.Years >= 5 into two smaller rectangles: one where Hits < 118 and one where Hits >= 118. This gives \(RSS = 91.33\).With recursive binary splitting, the prediction surface of regression trees looks like a (irregular) stair case. Below is an example with five terminal nodes. Notice the difference to linear regression which would be a plane.
rpart that does most of this work for us. Let’s have a look at it!The main function in the rpart package is also called rpart. Besides fitting a regression tree, it performs the cross-validation analysis for \(\alpha\) automatically in the background.
To use it, we need to pass it a model formula and data (like in lm) as well as the minimum \(\alpha\) value, which is referred to as cp (complexity parameter) in the package (0.01 by default).
Using all the variables in the baseball data, we can thus fit a “full” tree like this:
The output from print(rt) is not very user-friendly so we will visualize the tree using the rpart.plot package (result on the next slide):
rpart has done the necessary cross-validation (10-fold by default) exercise during fitting. We can look at the result by calling
Regression tree:
rpart(formula = logSalary ~ . - Salary, data = Hitters, control = list(cp = 0))
Variables actually used in tree construction:
[1] AtBat CAtBat CHits CHmRun CRBI CRuns Hits PutOuts Walks
[10] Years
Root node error: 207.15/263 = 0.78766
n= 263
CP nsplit rel error xerror xstd
1 0.5689379 0 1.00000 1.00594 0.065303
2 0.0612877 1 0.43106 0.46493 0.052709
3 0.0577844 2 0.36977 0.44229 0.055377
4 0.0307862 3 0.31199 0.36508 0.061041
5 0.0219449 4 0.28120 0.34821 0.059012
6 0.0130968 5 0.25926 0.35770 0.060186
7 0.0117008 6 0.24616 0.36255 0.060416
8 0.0106994 7 0.23446 0.35751 0.059687
9 0.0082164 8 0.22376 0.35799 0.060102
10 0.0054925 9 0.21555 0.34999 0.058923
11 0.0052416 10 0.21005 0.34842 0.059175
12 0.0048933 14 0.18886 0.35715 0.062130
13 0.0043587 15 0.18397 0.35914 0.062019
14 0.0043174 17 0.17525 0.36168 0.062240
15 0.0032817 18 0.17094 0.36331 0.062404
16 0.0029170 19 0.16766 0.36551 0.064355
17 0.0021653 20 0.16474 0.36653 0.064503
18 0.0019773 21 0.16257 0.36330 0.064039
19 0.0000000 22 0.16060 0.36789 0.068038
CP: The complexity parameter, a scaled version of \(\alpha\). In each row, it is the smallest penalty parameter, for which the number of splits is still equal to nsplit, e.g. 0.06129 is the smallest value for which there is only one split.nsplit: The number of splits induced by the value of CP.rel error: Scaled training RSS. More precisely, it is \(RSS/TSS\) in-sample. Will always decrease as the number of splits (i.e. the size of the tree) increases.xerror: Scaled test RSS. More precisely, it is \(RSS/TSS\) out-of-sample (i.e. averaged across the hold-out folds). Primary metric for assessing optimal number of splits.xstd: Standard deviation of xerror over the different folds.ggplot2 to create this plot, a quick way to do it is to simply run the function rsq.rpart on our tree)!df_error <- data.frame(nsplit = rep(rt_full$cptable[, 2], 2),
rss_scaled = as.vector(rt_full$cptable[, 3:4]),
type = rep(c("train", "test"), each = nrow(rt_full$cptable)))
ggplot(df_error, aes(x = nsplit, y = rss_scaled, group = type, color = type)) +
geom_vline(xintercept = 4, linetype = 2, linewidth = 0.25) +
geom_line() +
geom_point() +
labs(x = "Number of splits", y = "Scaled RSS (RSS/TSS)", color = "Sample",
title = "Using cost complexity pruning to choose the optimal number of splits",
subtitle = "Scaled train RSS and scaled test RSS (from 10-fold cross-validation) as a function of the number of splits") +
theme_minimal()Seems like a good value for the number of splits in the baseball example is 4. Increasing the number of splits to any point beyond that does not deliver any (meaningful) reduction in the scaled test RSS.
Now, looking back to the table, we have to find the complexity parameter cp corresponding to nsplit = 4:
It is 0.02194488. To finally optimally prune the tree, we now have to apply the prune function to our full tree rt_full together with the optimal cp parameter:
This is now a much more parsimonious model!
click data set from earlier (with predictors area_income and age) to dive deeper into these differences.Let’s consider all possible splits on the variable area_income: